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Abstract— This paper presents baseline results for the Third 
Facial Micro-Expression Grand Challenge (MEGC 2020). Both 
macro- and micro-expression intervals in CAS(ME)? and SAM- 
M Long Videos are spotted by employing the method of Main 
Directional Maximal Difference Analysis (MDMD). The MDMD 
method uses the magnitude maximal difference in the main 
direction of optical flow features to spot facial movements. 
The single frame prediction results of the original MDMD 
method are post processed into reasonable video intervals. 
The metric F1-scores of baseline results are evaluated: for 
CAS(ME)’, the Fi-scores are 0.1196 and 0.0082 for macro- 
and micro-expressions respectively, and the overall F1-score is 
0.0376; for SAMM Long Videos, the F1-scores are 0.0629 and 
0.0364 for macro- and micro-expressions respectively, and the 
overall F1-score is 0.0445. The baseline project codes is pub- 
licly available at https: //github.com/HeyingGithub/ 
Baseline-project—for-—MEGC2020_ spotting. 


I. INTRODUCTION 


Facial expressions are important non-verbal cues that 
convey emotions. Macro-expressions are the common facial 
expressions in our daily life, which are the types we usually 
know. There is a special type of expressions called “micro- 
expressions” that were first found by Haggard and Isaacs [5]. 
Micro-expressions (MEs) are involuntary facial movements 
occurring spontaneously when a person attempts to conceal 
the experiencing emotion in a high-stakes environment. The 
duration of MEs is very short. The general duration is less 
than 500 milliseconds (ms) [21], [10]. The close connection 
between MEs and deception makes the relevant research 
have great significance on many applications such as medical 
care [3] and law enforcement [4]. 

Spotting expressions is to find the moment when expres- 
sions occurs in the whole video sequences. In the Second 
Micro-Expression Spotting Challenge (MEGC 2019) [14], 
methods for spotting ME intervals in long videos were 
explored [7]. In the past decades, several explorations for 
spotting MEs have been done [12], [20], [18], [17], [16], 
[24], [9], [19], [8], [11]. However, MEs are often accom- 
panied by macro-expressions, and both of the two types of 
expressions are valuable for affect analysis. Therefore, devel- 
oping methods to spot both macro- and micro-expressions is 
the main theme of MEGC 2020. 

In this paper, we provide the baseline method and results 
for the Third Facial Micro-Expression Grand Challenge 
(MEGC 2020), spotting macro- and micro-expression inter- 
vals in long video sequences from the dataset CAS(ME)? 
and SAMM Long Videos. The main method is the Main 


Directional Maximal Difference Analysis (MDMD) [19]. 
The original MDMD only predicts whether a frame belongs 
to facial movements. To obtain target intervals, the adja- 
cent frames consistently predicted to be macro- or micro- 
expressions form an interval, and the intervals that are too 
long or too short are removed. Parameters are adjusted to 
specific expression types for specific datasets. The perfor- 
mance metric, Fl-scores, is used for the evaluation on the 
two long video datasets. 

The rest of paper is organized as follows: Section II 
presents the methodology and performance metrics. Sec- 
tion II introduces the detailed experiment results. Section IV 
concludes the paper. 


II. METHODOLOGY 


This section describes the benchmark datasets, the baseline 
method, and the performance metrics. 


A. Datasets 


CAS(ME)? [13]: In the part A of CAS(ME)? database, 
there are 22 subjects and 98 long videos. The facial move- 
ments are classified as macro- and micro-expressions. The 
video samples may contain multiple macro or micro facial 
expressions. The onset, apex, offset index for these expres- 
sions are given in the excel file. In addition, the eye blinks 
are labeled with onset and offset time. 

SAMM Long Videos [22] : The original SAMM dataset 
[2] contains 159 micro-expressions, which was used for the 
past two micro-expressions recognition challenge [23], [14]. 
Recently, the authors [22] released the SAMM Long Videos 
dataset, which consists of 147 long videos. There are 343 
macro-movements and 159 micro-movements in the long 
videos. The index of onset, apex and offset frames of micro- 
and macro-movements are outlined in the ground truth excel 
file. 

More detailed and comparative information of these two 
datasets is presented in Table I. 


B. Baseline method 


1) Preprocess: Expression spotting focuses on facial 
regions. So we preprocess every video sample by cropping 
and resizing facial regions in all frames. For each video, 
we locate the rectangular box that exactly bounds facial 
regions in the first frame, and then all the frames of the 
video are cropped and resized according to the box located 


TABLE I 
A COMPARISON BETWEEN CAS(ME)? AND SAMM LONG VIDEOS. 


Dataset CAS(ME)? | SAMM Long Videos 
Participants 22 32 

Video samples 98 147 
Macro-expressions | 300 343 
Micro-expressions | 57 159 

Resolution 640 x 480 2040 x 1088 

FPS 30 200 


in the first frame. We locate the bounding box according 
to facial landmarks detected by the corresponding function 
in the ’Dlib” toolkit [6], as we found that applying a face 
detecting algorithm directly cannot behavior very well. The 
preprocess details are as follows. 

Firstly, we use the landmark detecting function in the 
”Dlib” toolkit to obtain 68 facial landmarks on the face 
in the first frame of the video, as illustrated in the Fig. 
l(a) — the first frame of s23_0102 in CAS(ME)?. The 
landmarks are marked as Lj, L2,--- , Leg in the sequence 
of the list returned by the landmark detecting function in 
”Dlib”, and the corresponding coordinates are marked as 
(21,91), (G2, Y2), ++- , (£68, Yes). The coordinate system is 
consistent with the one in the OpenCV toolkit [1], i.e. x- 
axis means the horizontal direction from left to right, and 
y-axis means the vertical direction from top to bottom. The 
green dots in Fig. 1(a) are the landmarks, and some of the 
serial numbers are marked by the red text. 

Secondly, in order to form a rectangular box that bounds 
the facial region exactly, the leftmost, rightmost, topmost 
and bottommost landmarks are marked as Lı, Lr, Lt, Lo 
with coordinates (2, y1), (£r, Yr), (Lt, Yt), (Lo, Yo), respec- 
tively. Rather than forming the box directly according to 
Lı, Lp, Lt, La, we form two points: A(z, ye — (y37 — 
yi9)); B(£r, yp) to obtain the box B with A as the upper 
left corner and B as the lower right corner. The coordinate 
yt — (y37 — Y19) means that the upper edge of the box is 
moved up a relative distance to maintain more regions around 
eyebrows. In Fig. 1(a), the box B is illustrated by the blue 
rectangular. 

Thirdly, as shown by Fig. 1(b), which is the region in 
B, we found there are too many regions in the bottom for 
several subjects in the two datasets because of the inaccuracy 
of landmark detecting, and so, we detect landmarks again 
on the region of the first frame in B for cropping faces 
more precisely. It is shown in the Fig. l(c). Then, we get 
a new bottommost landmark L,(x,, y). B is updated to 
B' (£r, Ymin), Where Ymin is the smaller one of yp and yj. 
Then a new rectangular box B’ is formed with A as the upper 
left corner and B’ as the lower right corner. In Fig. 1(c), the 
box B’ is illustrated by the blue rectangular. And the region 
of the first frame in B’ is illustrated in Fig. 1(d), in which 
we can find that the facial region is located better. 

Finally, after obtaining the box B’, we crop all the frames 
of the video in the rectangular box B’, and thus get the facial 
regions. The cropped regions are then resized to the size of 
227 x 227. 


2) MDMD: The method of Main Directional Maximal 
Difference Analysis (MDMD) is proposed in the literature 
[19]. The main idea is that: when an expression happens, 
the face will experience a process of producing an expres- 
sion and returning to a neutral face. The main movement 
directions will be opposite in the process. By analyzing it, 
expressions can be spotted. Here we review the MDMD 
method. 

Given a video with n frames, the current frame is denoted 
as Fi. F;_, is the k-th frame before the F;, and Fj, is 
the k-th frame after the F;. The robust local optical flow 
(RLOF) [15] between the F;_; frame (Head Frame) and the 
F; frame (Current Frame) is computed. We denote the optical 
flow by (u4°, v#°). For convenience, (u¥ C, vHC) means 
the displacement of any point. Similarly, the optical flow 
between the F;_, frame (Head Frame) and the Fj, frame 
(Tail Frame) is denoted by (u#T, v#T). Then, (u#%, v¥°) 
and (u#?,v#) are converted from Euclidean coordinates 
to polar coordinates (p@°,0%°) and (p47,047), where p 
and 0 represent, respectively, the magnitude and direction. 

Based on the directions {0/7}, all the optical flow vectors 
{(p"°,0%C)} are divided into a directions. Fig. 2 illustrates 
the condition when a = 4. The Main Direction © is the 
direction that has the largest number of optical flow vectors 
among the a directions. The main directional optical vector 
(po, 04°) is the optical flow vector (p%/°, 0) that falls 
in the Main Direction ©. 
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Four directions in the polar coordinates. 
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The optical flow vector corresponding to (pi/C,04/°) be- 
tween F;_, frame and F;, is denoted as (p47, 0417). 


{(onr On )} = (077,08 p"T, 0") and (Ph OX) 


are two different vectors of the same point in F;_,} 
(2) 
After the differences pC — pi" is sorted into a descending 
order, the maximal difference d? is defined as me moan 
difference value of the first 1/3 of the differences p%, va — pit 7, m 
to characterize the frame F; as in the formula: 


= : > maxo” — pT} (3) 
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Fig. 1. Diagram of how we obtain facial regions in the preprocessing step: (a) detect facial landmarks and form the rectangular box B; (b) the region in 
B; (c) detect facial landmarks in the region in B and form the rectangular box B’; (d) the region in B’. 


where g = |{(p°,0%°)}| is the number of elements in the 
subset {(p°, 0#°)}, and max,, S denotes a set comprised 
of the first m maximal elements in the subset S. 

Since our method is a block-based analysis, the cropped 
facial region of each frame is divided into bxb blocks, as 
shown in Fig. 3. And we calculate the maximal difference 
di (9 = 1,2,--- ,b?) for each block in the F; frame. For 
frame F;, there are b? maximal differences di, due to the bxb 
block structure. Then, we arrange the b? maximal differences 
di in a descending order where d’ is the first s maximal 


difference and characterizes the frame F; feature: 


A 1 i - 2 
d = >> imax{dj},7=1,2,---,0 (4) 
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Fig. 3. Examples of facial 6 x 6 block structure. 


If a person maintained a neutral expression at F;_;, her/his 
emotional expression, such as disgust, starts at the onset 
frame between F;_;, and F;, is repressed at the offset frame 
between F; and F;,, and then the facial expression recovers 
a neutral expression at F;+x, which is presented in Fig. 4(a). 
In this circumstance, the movement between F; and F;_; 
is more intense than the movement between F;,, and F;_, 
because the expression is neutral at both Fj4, and F-k. 
Therefore, the dê value will be large. Another situation is 
that a person maintains a neutral expression from F;_; to 
Fik. The movement between F; and F;_, is similar to the 
movement between Fj, and F;_;; thus, the d’ value will be 
small. In a long video, sometimes an emotional expression 
starts at the onset frame before F_, and is repressed at the 


offset frame after F;,;, which is presented in Fig. 4(b). In 
this case, the d’ value will also be small if k is set to be 
a small value. However, k cannot be set as a large value 
because this would influence the accuracy of the computing 
optical flow. 

We employed a relative difference vector for eliminating 
the background noise, which was computed by: 


nad —= (d RFA pg n) „i= k+1,k+2, ,n—k 

(5) 

Therefore, the frame F; is characterized by r’. A threshold 

is used to obtain the frames that have peaks representing the 
facial movements in a video: 


threshold =T mean + px (Finas i+ T medn) (6) 


where 
i=k+1 
1 i i=k+1 į 
rea] — =r > r’ and Tmax = Max r’. 
n— 2k k n—k 


p is a variable parameter in the range [0,1]. The frames 
with rê larger than the threshold are the frames where 
expressions appear. 

3) Parameter settings and post process: In the literature 
[19], several parameter combinations are explored to spot 
micro-expressions on the CAS(ME)? dataset. For spotting 
both macro- and micro-expressions on the two datasets for 
MEGC 2020, i.e. CAS(ME)? and SAMM Long Videos, we 
select the best combination of blocks and directions explored 
in [19]. and we set other parameters according the FPSs of 
the two datasets. Moreover, since the original MDMD only 
predicts whether a frame belongs to facial movements, a post 
process is added in order to output target intervals required 
by MEGC 2020. The details are as follows. 

The number of blocks is set to 6 x 6 and the number 
of directions a is set to 4. In CAS(ME)? dataset, the 
k is set to 12 for micro-expressions, and 39 for macro- 
expressions; in SAMM Long Videos dataset, the k is set 
to 80 for micro-expressions, and 260 for macro-expressions. 
Concerning the threshold, p varies from 0.01 to 0.99 with 
a step-size of 0.01. And the final results are reported under 
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Fig. 4. Two situations: (a) An emotional expression starting at the onset frame between F;_; and F; is repressed at the offset frame between F; and 
Fi+kķ and recovers a neutral expression at F;+}ķ; (b) An emotional expression starting at the onset frame before F;_; is repressed at the offset frame 


after Fik- 


the setting of p = 0.01. The original MDMD only predicts 
whether a frame belongs to facial movements. To output 
target intervals, the adjacent frames consistently predicted 
to be macro- or micro-expressions form an interval, and 
the intervals that are too long or too short are removed. 
The number of micro-expression frames is limited between 
7 and 16 for the CAS(ME)? dataset, and between 47 and 
105 for the SAMM Long Videos dataset. The number of 
macro-expression frames is defined as larger than 16 for the 
CAS(ME)? dataset, and larger than 105 for the SAMM Long 
Videos dataset. 


C. Performance metrics 


In order to avoid the inaccuracy caused by annotation, we 
propose to evaluate the spotting result per interval in MEGC 
2020. 


1. True positive in one video definition 


The true positive (TP) per interval in one video is first 
defined based on the intersection between the spotted interval 
and the ground-truth interval. The spotted interval Wspottea 
is considered as TP if it fits the following condition: 


W spotted N W groundTruth > k (7) 


W spotted U W groundTruth 


where k is set to 0.5, WoroundTruth represents the ground 
truth of the macro- or micro-expression interval (onset- 
offset). If the condition is not fulfilled, the spotted interval 
is regarded as false positive (FP). 


2. Result evaluation in one video 


Supposing there are m ground truth interval in the video, 
and n intervals are spotted. According to the overlap evalu- 
ation, the TP amount in one video is counted as a (a < m 
and a < n), therefore FP = n—a, FN = m — a. The spotting 
performance in one video can be evaluated by following 
metrics: 

a a a 
Recall = —, Precision = — (8) 
m n 


2TP o 2a 
2TP+FP+FN m+n 


Yet, the videos in real life have some complicated situa- 
tions which influences the evaluation per single video: 


(9) 


F — score = 


e There might be no macro- nor micro-expression in the 
test video. In this case, m = 0, the denominator of 
recall would be zeros. 

e If there is no spotted intervals in the video, the denom- 
inator of precision would be zeros since n = 0. 

e It is impossible to compare two spotting methods when 
both TP amounts are zero. The metric (recall, precision 
or Fl-score) values both equal to zeros. However, the 
Method, outperforms Methods, if Method, spots less 
intervals than Method. 


Thus, to avoid these situations, we propose for single video 
spotting result evaluation, we just note the amount of TP, FP 
and FN. Other metrics are not considered for one video. 

3. Evaluation for entire database 


Supposing in the entire dataset, 


e There are V videos including M; macro-expressions 
(MaEs) sequences and M micro-expression (MEs) se- 
quences, where Mı = S4 mı; and My = Si mais 

e The method spot N; MaE intervals and N3 ME intervals 
in total, where N = es ny; and No = S4 Nzi; 

e There are A, TPs for MaE and A» TPs for ME in total, 
where Aj = a ali and Ag = S4 A2i. 

The dataset could be considered as one long video. The 
results are firstly evaluated for the MaE spotting and ME 
spotting separately. Then the overall result for macro- and 
micro spotting is evaluated. The recall and precision for 
entire dataset can be calculated by following formulas: 


e for macro-expression: 


A A 
RecallMaE_D = A PrecisionyaE_D = i (10) 
e for micro-expression: 
A A 
RecallmE_D = ie Precisiony gp = a (11) 
e for overall evaluation: 
A A A, +A 
Recallp = Mora Precisionp = eae (12) 


Then, the values of F'/-score for all these three evaluations 
are obtained based on: 


2 x (Recall x Precision) 


F1 — score = (13) 


Recall + Precision 


The champion of the challenge will be the best score for 
overall results in spotting micro- and macro-expressions. 


III. RESULTS AND DISCUSSION 


For the parameter p, we have studied the evaluation results 
by varying p from 0.01 to 0.99 with step-size of 0.01, and the 
20 results from 0.01 to 0.20 are shown in Table II. In Table II, 
we list the information of TPs and Fl-scores for macro- and 
micro-expression spotting respectively. We observe that, for 
both types of expressions in the two datasets, the number 
of TP is decreasing with the increase of p. Regarding the 
Fl-score, it also shows a decreasing trend in SAMM Long 
Videos. Yet, in CAS(ME)’, the F1-score increases at first and 
then begins to decrease. The initial increase of the Fl-score 
in CAS(ME)? is mainly because the number of the totally 
predicted intervals (n) become smaller with the increase of 
p, making the precision (a/n) increase. 

Since the amount of TP is an important metric for the 
spotting result evaluation, we select the results under the 
condition of p = 0.01 as the final baseline results. The 
details of the final baseline results for spotting macro- and 
micro-expressions are shown in Table III. For CAS(ME)?, 
the Fl-scores are 0.1196 and 0.0082 for macro- and micro- 
expressions respectively, and 0.0376 for overall result. For 
SAMM Long Videos, the Fl-scores are 0.0629 and 0.0364 
for macro- and micro-expressions respectively, and 0.0445 
for overall result. More details about the number of true 
labels, TP, FP, FN, precision, recall and Fl-score for various 
situations are shown in the Table MI. 


IV. CONCLUSIONS 


This paper addresses the challenge in spotting macro- and 
micro-expressions in long video sequence, and provides the 
baseline method and results for the Third Facial Micro- 
Expression Spotting Challenge (MEGC 2020). The Main 
Directional Maximal Difference Analysis (MDMD) [19] is 
employed as the baseline method, and the parameter settings 
are adjusted to CAS(ME)? and SAMM Long Videos for 
the spotting challenge in MEGC 2020. Slight modification 
are done to predict more reasonable intervals on the post- 
processing of results. Experiments were done and the pre- 
dicted results were evaluated using the metrics in MEGC 
2020. The results have shown that the MDMD method can 
produce reasonable performance, but there are still a huge 
challenge to reduce the amount of FPs. 
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